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Abstract 

Background The structure of molecular networks derives from dynamical processes on evolutionary time scales. 
For protein interaction networks, global statistical features of their structure can now be inferred consistently from 
several large-throughput datasets. Understanding the underlying evolutionary dynamics is crucial for discerning 
random parts of the network from biologically important properties shaped by natural selection. 
Results We present a detailed statistical analysis of the protein interactions in Saccharomyces cerevisiae based 
on several large-throughput datasets. Protein pairs resulting from gene duplications are used as tracers into the 
evolutionary past of the network. From this analysis, we infer rate estimates for two key evolutionary processes 
shaping the network: (i) gene duplications and (ii) gain and loss of interactions through mutations in existing 
proteins, which are referred to as link dynamics. Importantly, the link dynamics is asymmetric, i.e., the evolutionary 
steps are mutations in just one of the binding parters. The link turnover is shown to be much faster than gene 
duplications. Both processes are assembled into an empirically grounded, quantitative model for the evolution of 
protein interaction networks. 

Conclusions According to this model, the link dynamics is the dominant evolutionary force shaping the statistical 
structure of the network, while the slower gene duplication dynamics mainly affects its size. Specifically, the model 
predicts (i) a broad distribution of the connectivities (i.e., the number of binding partners of a protein) and (ii) 
correlations between the connectivities of interacting proteins, a specific consequence of the asymmetry of the 
link dynamics. Both features have been observed in the protein interaction network of S. cerevisiae. 



Background 

Molecular interaction networks are ubiquitous in bi- 
ological systems. Examples include transcription 
control [1] , signal transduction, and metabolic path- 
ways [2]. These networks have become a focus of 



recent research, because of their important roles in 
metabolism, gene expression, and information pro- 
cessing. Data on such networks are rapidly accu- 
mulating, massively aided by high-throughput ex- 
periments. Some of these networks are sufficiently 
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complex that their characterization requires statis- 
tical analysis, an area of considerable recent inter- 
est [3-5]. One key issue in this area is the distinc- 
tion between structures reflecting biological function 
and those arising by chance. To address this issue 
requires an understanding of the biological processes 
that shape the network on evolutionary time scales. 
More precisely, one has to identify the statistical ob- 
servables containing specific information about the 
evolutionary dynamics that shape a network. 

In this paper we focus on protein interaction 
networks, whose nodes correspond to proteins, and 
whose links correspond to physical interactions be- 
tween two proteins. Several complementary exper- 
imental techniques have been used to analyze pair- 
wise protein and domain interactions, as well as pro- 
tein complexes, in genome-scale assays [6-13]. Com- 
mon to these approaches is a high rate of individual 
false negative and false positive interactions [14,15]. 
Different protein interaction data sets thus differ in 
many ways, but they also reveal similar aggregate 
(or global) network features, such as the fraction of 
nodes with a given connectivity. This implies that 
only large-scale statistical features of protein inter- 
action networks can currently be reliably identified 
by high-throughput approaches. We here present 
an empirically grounded model that explains empir- 
ically observed statistical features of such networks. 

The currently best characterized protein inter- 
action network is that of the baker's yeast Saccha- 
romyces cerevisiae. On evolutionary time scales, this 
network changes through two processes, illustrated 
by figure 1. These are (i) modifications of inter- 
actions between existing proteins and (ii) the intro- 
duction of new nodes and links through gene duplica- 
tions. Duplications of a single gene result in a pair of 
nodes with initially identical binding partners. Seg- 
mental and global duplications of the genome lead 
to the simultaneous duplication of many genes. On 
the other hand, processes affecting the interactions 
between existing proteins are referred to as link dy- 
namics. Link dynamics results primarily from point 
mutations leading to modifications of the interface 
between interacting proteins [16]. Both kinds of pro- 
cesses, link dynamics and gene duplications, can be 
inferred from a statistical analysis of the network 
data, and their rates can be estimated consistently 
with independent information. 

Of course, proteome function in vivo is influenced 
by further factors, notably gene regulation, which 
determines the concentrations of the proteins inter- 



acting in a living cell. The very definition of a bound 
state depends on the concentrations of the binding 
partners: A pair of proteins which binds at high con- 
centrations may no longer form a bound state at 
lower concentrations. Here we concentrate on pro- 
tein interactions at constant concentrations as they 
can be inferred from high-throughput datasets. 

Previous work by others [17-19] shows how struc- 
tural features of the network can in principle be 
explained through mathematical models of network 
evolution based on gene duplications alone. (For 
similar duplication-based models of regulatory and 
metabolic networks, see [20,21].) However, the over- 
all rate of link dynamics has been estimated from 
empirical data in [22] and is at least an order of 
magnitude higher than the growth rate of the net- 
work due to gene duplications. It must therefore be 
included in any consistent evolutionary model. 

In this paper, we present a model of network evo- 
lution that is based on observed rates of link and du- 
plication dynamics. At these rates, the model pre- 
dicts that important structural features of the net- 
work are shaped solely by the link dynamics. Hence, 
the evolutionary scenario of our model is quite differ- 
ent from the duplication-based models [17-19]. The 
statistical network structure predicted by the model 
is in accordance with empirical observations, see the 
discussion below. 

This paper has two parts. In the first part, we es- 
timate the rates of link attachment and detachment 
from empirical data. Specifically, we do not just es- 
timate average rates of link dynamics for the whole 
network, because this has been done previously [22] , 
but we show how the dependence of link attachment 
and detachment rates depends on the connectivities 
of both nodes (proteins) involved. (The connectivity 
of a protein is defined as the number of its interac- 
tion partners). We find evidence that the basic rate 
of link attachment is asymmetric. That is, this rate 
increases with the connectivity of only one of two the 
nodes involved. This reflects an asymmetry in the 
underlying biological process: a new protein-protein 
interaction is typically formed through a mutation 
in only one of two proteins. 

In the second part of the paper, we assemble the 
estimated rates of link dynamics into a model of net- 
work evolution. Unlike for most other cases studied 
so far [3,4] , the dynamics of these networks cannot be 
written as a closed equation dependent on the con- 
nectivity distribution, i.e. the fraction of nodes with 
a given number of neighbors. Instead, the analysis of 
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Figure 1: The elementary processes of protein network evolution. The progression of time is 
symbolized by arrows, (a) Link attachment and (b) link detachment occur through nucleotide sub- 
stitutions in the gene encoding an existing protein. These processes affect the connectivities of the protein 
whose coding sequence undergoes mutation (shown in black) and of one of its binding partners (shown in 
gray). Empirical data shows that attachment occurs preferentially towards partners of high connectivity, 
cf. fig. 3. (c) Gene duplication usually produces a pair of nodes (shown in black) with initially identical 
binding partners (shown in gray) . Empirical data suggests duplications occur at a much lower rate than link 
dynamics and that redundant links are lost subsequently (often in an asymmetric fashion), which affects the 
connectivities of the duplicate pair and of all its binding partners [22,25,43]. 



networks under asymmetric link dynamics involves 
the link connectivity distribution, defined as the frac- 
tion of links connecting a pair of nodes with given 
connectivities. 

The model has only one free parameter, the av- 
erage connectivity of nodes in the network. Its sta- 
tionary solution correctly predicts statistical proper- 
ties observed in the data. Central properties of this 
solution are connectivity correlations between neigh- 
boring vertices, in accordance with recent observa- 
tions in high-throughput protein interaction data 
[23]. These correlations are a consequence of the 
asymmetric link attachment process. 



Results and discussion 
Estimates of evolutionary rates 

Two kinds of processes contribute to the evolution- 
ary dynamics of protein interaction networks. The 
first consists of point mutations in a gene affecting 
the interactions of the encoded protein. As a re- 
sult, the corresponding node may gain new links or 
loses some of the existing links to other nodes, as 
illustrated in fig. 1(a) and 1(b), respectively. We re- 



fer to these attachment and detachment processes, 
which leave the number of nodes fixed, as link dy- 
namics. The second kind of process consists of gene 
duplications followed by either silencing of one of the 
duplicated genes or by functional divergence of the 
duplicates [24-26] . In terms of the protein interac- 
tion network, a gene duplication corresponds to the 
addition of a node with links identical to the original 
node, followed by the divergence of some of the now 
redundant links between the two duplicate nodes; 
see fig. 1(c). 

Individual yeast genes have been estimated to 
undergo duplication at a rate of the order of 10~ 2 
per gene and per million years [27]. Some 90% of 
single gene duplicates become silenced shortly after 
the duplication, leading to an effective rate g of du- 
plications one order of magnitude lower, i.e., <~ 10~ 3 
per million years [22,25,27,28]. Only a fraction of 
the yeast proteome is part of the protein interaction 
network, and gene duplicates involving proteins that 
are not part of the network do not contribute to its 
growth. Hence, g ~ 1CT 3 per million years should 
be considered an upper bound for the growth rate of 
the protein interaction network by gene duplications. 
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A crude lower bound for the link attachment rate is 
a ~ 10 _1 new interaction partners per node and mil- 
lion years. For instance, [22] estimated the rate at 
which new interactions were formed as no less than 
294.5 new interactions per million years and approx- 
imately 1000 proteins. (These estimates are based 
on the formation of physical interactions between 
products of duplicate genes, and the approximately 
known age of the duplicates [22] . Importantly, most 
of these new interactions form between old dupli- 
cates, duplicates that are no longer under the relaxed 
selection pressure that is characteristic of young du- 
plicates.) The above estimate gives a number of new 
interaction partners per protein per million years of 
a = 2 x 294.5/1000 = 0.589, five times greater than 
the lower bound of 0.1. To maintain an average net- 
work connectivity at the empirically observed value 
k ~ 2.5 interaction partners per protein [25,29], the 
link detachment rate d has to be close to a, thus 
d ~ a ~ 10 _1 per million years. This rate of link 
attachment and detachment is much larger than the 
duplication rate of g ~ 10 -3 per protein and million 
years. Hence, the link dynamics is decoupled from 
the much slower duplication dynamics. On interme- 
diate evolutionary time-scales, the network reaches a 
stationary state of the link dynamics, while its num- 
ber of nodes does not change significantly. This sta- 
tionary state determines the structural statistics of 
the network, in particular the distribution of connec- 
tivities. On long time-scales, however, the network 
may grow through duplications. Wc emphasize that 
all these evolutionary rates are order-of-magnitude 
estimates, and that such estimates are sufficient for 
our model and the conclusions we derive from it. 

One basic but important empirical observation 
about link dynamics is the fast loss of connectivity 
correlations of proteins encoded by duplicate genes. 
Fig. 2(a) shows this loss, as estimated from empir- 
ical data. Specifically, the figure shows the aver- 
age relative connectivity difference \k — k'\/(k + k') 
of duplicate protein pairs as a function of the time 
since duplication, parameterized by the fraction K s 
of synonymous (silent) nucleotide substitutions per 
silent site. (As an order of magnitude estimate, a 
value of K s = 0.1 corresponds to a duplication age 
of 10 million years [25,27].) In the shortest time in- 
terval after duplication, the connectivities are still 
measurably similar. Soon thereafter, however, the 
relative connectivity difference becomes statistically 
indistinguishable from that of a randomly chosen 
pair of nodes, indicated by the horizontal line in 



fig. 2(a). Hence, diversification after duplication is 
a rapid process, with a time constant of the order of 
several 10 million years, consistent with the fast rate 
of link dynamics discussed above. 

An additional empirical observation underscores 
the minor importance of gene duplication in shaping 
the observed network structure. In models of net- 
work evolution based on gene duplication [17-19], a 
protein acquires new links through duplications of 
its neighbors (see, for example, the grey nodes in 
fig. 1(c)), at a rate proportional to its connectiv- 
ity. This mechanism would generate an abundance 
of high-connectivity nodes. In addition, it would also 
generate a high fraction of pairs of neighbors that are 
products of a gene duplication. This is also true for 
intermediate models, incorporating both gene dupli- 
cations and link dynamics, provided the duplication 
rate is comparable to the rate of link dynamics, or 
exceeds it. However this prediction of models based 
on gene duplications is not supported by the data. 
Fig. 2(b) shows the fraction of duplicate protein 
pairs among the k(k — l)/2 neighbor pairs of a node 
of connectivity k. This fraction is small and it does 
not increase significantly with k. The data in this 
figure are also consistent with the earlier observation 
that the majority of duplicate pairs have few or no 
interaction partners in common [25]. 

We note that in our discussion of node dynamics 
we have not separately considered the effects of an- 
cient genome duplications [30,31]. The conclusion 
that gene duplications do not shape the statistical 
features of the protein interaction network applies 
both to single gene duplications and to genome du- 
plications. Indeed, the analysis of duplicates pre- 
sented in figure 2 includes both pairs of genes re- 
sulting from single duplications and those stemming 
from genome duplications. Furthermore, the evo- 
lutionary dynamics of individual duplicated genes 
is similar for the products of single genome and 
whole genome duplications. For example, individ- 
ual gene duplicates are lost with approximately the 
same probability in single duplications and in whole 
genome duplications. For this reason we do not, at 
this stage, include genome duplications separately in 
our model. 



Dependency of attachment rates on connectivities 

The total rates a and d at which links are attached 
and detached in a protein interaction network allow 
no inference of how these processes shape the sta- 
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Figure 2(a): Duplicate protein pairs lose their connectivity correlations over time. The av- 
erage relative connectivity difference \k — k'\/{k + k') of duplicate pairs with connectivities k, k' > is 
plotted against the time since duplication, parameterized by the synonymous (silent) nucleotide divergence 
K s . The horizontal line indicates the value expected for two randomly chosen nodes. The average number 
of duplicate pairs per bin was 16 (from low values of K s to high ones the number of duplicate pairs per 
bin were 12, 5, 3, 6, 6, 8, 13, 27, 44 respectively), (b) Duplications do not strongly influence network 
structure. The histogram shows the fraction of duplicate pairs among the k(k — l)/2 neighbor pairs of a 
node of connectivity k plotted versus k. A high number of duplicate pairs would be expected if duplications 
were a significant mechanism of link gain, see text. The mean and the standard error of this fraction were 
determined using proteins which are products of duplicate genes with sequence similarity K a < 1. The 
number of vertices used per column ranges from 374 for k = 2 to 8 for k = 12. 



tistical properties of the network. To make such an 
inference, one must also know how the link dynamics 
depends on the connectivities of the nodes involved. 
The simplest possibility is that link attachment rates 
a and detachment rates d are functions of a node's 
connectivity k. The rates and dk at which links 
are attached or detached from a node of connectivity 
k have been estimated previously using interactions 
between products of duplicate genes [22]. They in- 
crease approximately linearly with k. 

In representing attachment and detachment rates 



(a, d) as functions of connectivity k (afe, dk), one 
assumes implicitly that that the mechanism of link 
attachment and detachment is identical (symmetric) 
for the two nodes involved in a changed link. Pre- 
vious analyses of protein network evolution [22] as 
well as models of network evolution [32] were based 
on such a symmetric process. However, the biologi- 
cal mechanism underlying link dynamics is intrinsi- 
cally asymmetric. When a new link is formed, typi- 
cally only one node undergoes a mutation, whereas 
the other node remains unchanged. This asymmetry 
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means that the rate of link dynamics will generically 
depend in one way on the connectivity of the node 
undergoing mutation, and in another way on that of 
the unchanged node. As a result the rates a k and d k 
of link attachment and detachment are insufficient 
to describe the dynamics of the network, since these 
rates will be different depending on whether the node 
is undergoing a mutation or not. This observation 
motivates the following estimate of the dependency 
of the link dynamics rate on node connectivities. 

We define a k ,k' as the probability per unit time 
that a given non- interacting pair of proteins with re- 
spective connectivities k and k' will acquire a link, 
multiplied by the number of proteins N. Analo- 
gously, we define the detachment rate dk,k' as the 
probability per unit time that a given interacting 
pair of proteins with respective connectivities k and 
k' will lose their link. The scaling convention of both 
rates is chosen such that the average connectivity of 
the network remains constant as the number of nodes 
N increases: the number of nodes pairs (where a link 
may be added) is proportional to N 2 , whereas the 
total number of links (which may be deleted) is pro- 
portional to N. We refer to the special case where 
the rates factorize, i.e. dk,k' ~ flfcOfc', as symmet- 
ric attachment (and analogously for the detachment 
rates dk,k')- The specific form of these rates assumes 
that link dynamics is a local process, so the proba- 
bility for the formation or destruction of a link de- 
pends on the connectivities of only the two proteins 
involved in this process. 

We now explain how one can estimate the de- 
pendency of dk,k' on its arguments, k and k! . As 
described earlier [22] , one can use the observed num- 
ber of physical interactions among duplicate gene 
products (cross-interactions) to estimate attachment 
rates. Briefly, such cross-interactions may arise 
in two ways. First, a protein that forms homod- 
imers (a self-interacting protein) may undergo du- 
plication, leading to two identical self-interacting 
proteins which also interact with each other. If 
both self-interactions are subsequently lost indepen- 
dently, yet the interaction between the nodes is re- 
tained, a cross-interaction is formed. This scenario 
does probably not account for the majority of cross- 
interactions, because it is inconsistent with data sug- 
gesting that self-interactions do not get lost overly 
frequently after duplication [22] . The second avenue 
of forming interactions between duplicate gene prod- 
ucts involves a non-homodimerizing protein that un- 
dergoes duplication. Subsequently, an interaction 



between the duplicate proteins may form. If this 
mechanism is dominant, as we argue, one may use 
the number of cross-interactions to obtain order-of- 
magnitude estimates of the attachment rate [22]. 
From the number of interactions that each of the 
two involved proteins has with other proteins, one 
can estimate how the attachment rate depends on 
k and k' . The main caveat of this approach is that 
the connectivity of the duplicates may have changed 
since the time the link between them was formed. 

The result of this procedure is shown in fig. 3. 
The sample size of 38 cross- interactions is extremely 
limited, but sufficient to demonstrate an increase of 
the attachment rate along the diagonal k — k! , and 
no systematic change along other directions. A dif- 
ferent representation of the same data in fig. 3b) also 
shows an increase of the attachment rate consistent 
with k + k'. 

An attachment process where one node with con- 
nectivity k is chosen with a probability a\, and a 
second one is chosen with probability a\, gives an 
attachment rate a kk < ~ Uk a k' + aj.,a k . The attach- 
ment rate a kk ' ~ k + k' which we observe empirically 
is thus explained by an asymmetric attachment pro- 
cess where one node is chosen uniformly at random 
(a\ = constant), and the other node is chosen with a 
probability proportional to its connectivity (a\ ~ k) . 
Note that the rate ak.k 1 itself is symmetric under 
interchange of the labels k and k', since either of 
the two nodes may take on the role of being pref- 
erentially chosen. However, the rate ak,k' does not 
factorize, exactly as required for an asymmetric at- 
tachment process. 

We now present an additional, complementary 
approach, based on maximum likelihood analysis, 
which validates the functional form of ak,k'- The 
probability that out of n kk ' pairs of duplicates with 
given connectivities k and k', nikk' pairs interact 
is C^%{g kkl ) m ^'{l - g fcfc 0" fcfc '~ m "', where g kk , 
gives the probability for a cross-interaction. = 
n\j (m!(n — to)!) are the binomial coefficients. The 
probability p for observing for each pair k < k' m kk i 
interactions in n kk ' pairs of duplicates is then given 

byp = U k < k 'C n m %(9kk') m ^(l - 9kk') n ""'- m --'. 
Symmetric and asymmetric attachment differ in how 
the probability of a cross-interaction g kk > depends 
on k and k! . In the symmetric case, g kk > = g k g k > ■ 
In the asymmetric case where one node is chosen 
uniformly, the other with a probability fk, we have 
9kk' = fk + fk 1 ■ Using simulated annealing [33] we 
have calculated the (maximal) likelihoods p that the 
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connectivity correlation pattern shown in fig. 3a re- 
sulted from either an asymmetric process, or a sym- 
metric process, respectively, by maximizing p with 
respect to fk and gt ■ We find that the maximal like- 
lihood for asymmetric attachment exceeds that for 
symmetric attachment by a factor pasym/psym ~ 4. 
The data thus favor an asymmetric attachment pro- 
cess, consistently with the biological interpretation 
given above. In addition, in the maximum likelihood 
analysis of the asymmetric model, shows an ap- 
proximately linear increase with k (see figure 3c). 
Although this result is by no means conclusive, the 
data shows there is no reason to a priori consider 
only symmetric processes. 

Thus far, we have only discussed the link attach- 
ment rate. For the detachment of links, we analo- 
gously assume that links are lost due to mutations 
at one of two linked nodes, and that the rate of this 
process does not depend on the properties of the 
other node that is unaffected by a mutation. The 
simplest mechanism reflecting these assumptions is 
one where a protein loses on average d links per unit 
time. A protein is chosen in an equiprobable manner 
from all nodes for removal of one of its links. The 
link to be removed is chosen at random from all its 
links. (An alternative detachment process consists 
in the loss of a certain fraction of links and leads to 
very similar results.) The resulting detachment rate 
is rffe.fe' ~ (1/fc) + where the inverse terms 

stem from nodes (rather than links) being chosen 
uniformly. 

Dynamical model of network evolution 

The rates of the link dynamics discussed above, to- 
gether with a slow growth of the network due to 
duplications, define a simple model for the evolu- 
tion of protein interaction networks. Unlike previ- 
ous models of the evolution of protein interaction 
networks [17-19] which emphasize the role of gene 
duplications, our model is based on the asymmet- 
ric link dynamics deduced from empirical data in 
the preceding section. By analytical solution or by 
numerical simulation one may investigate the net- 
works generated by our model and compare their 
statistical properties to those of the empirical data 
on protein-interaction networks. This will be done 
in the present section. Before analyzing this model 
in the limit of large networks, we discuss the specific 
values of model parameters we used, and present the 
results of numerical simulations of a finite network. 



We chose the initial network size such that af- 
ter a sufficient waiting time, when a stationary state 
has been reached, the size of the simulated network 
matches that of the protein interaction data set we 
used (see methods). Duplication of nodes is mod- 
eled simply by adding new nodes with connectivity 
zero to the network at a rate of g = 1CP 3 per node 
per million years, as motivated above. Using this 
simplistic growth mechanism is appropriate since, 
as shown above, the link dynamics will quickly alter 
the initial connectivity of a new node, as well as con- 
nectivity correlations with its neighbors. We begin 
with a total number of 4600 nodes, uniformly linked 
at random (giving a Poissonian connectivity distri- 
bution) such that the average connectivity of nodes 
with non-zero connectivity is k = 2.5, the average 
connectivity found in the data set we used. After 
a waiting time of 25 million years there are 4696 
nodes in total, of which 1872 nodes have non-zero 
connectivity. This is the size of the pooled protein 
interaction data set we used. The waiting time of 
25 million years is of the same order of magnitude 
as the time scale on which connectivity correlations 
of duplicate nodes decay in figure 2a) of a few 10 
million years. 

New links are added at a rate of a = 0.59 new 
interactions per node per million years, using the 
asymmetric preferential linking rule we motivated 
above. Specifically, to form a new link we chose 
one node uniformly and a second node preferentially 
(i.e., with a probability proportional to its connec- 
tivity k) and link the two nodes. We removed links 
at a rate that keeps the average connectivity con- 
stant. Specifically, at each time-step a link is deleted 
by choosing a node uniformly for link deletion if the 
average network connectivity exceeds n — 2.5. The 
link to be deleted is chosen equiprobably from the 
links of this node. The connectivity distribution of a 
network whose evolution was simulated in this man- 
ner is shown in figure 4a) (open circles, o). This 
distribution is robust with respect to changes in the 
ratio of duplication to link dynamics g/a over at least 
an order of magnitude (results not shown) . 

We now turn to the consequences of this evo- 
lutionary dynamics for the statistical properties of 
the network. Since the link dynamics places and 
removes a link with a rate depending only on the 
connectivities of the nodes at either end, the evolu- 
tionary dynamics of the network can be represented 
in terms of the link connectivity distribution qk,k'- 
This distribution is defined as the fraction of net- 
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Figure 3: Link attachment occurs preferentially towards proteins of high connectivity, (a) The 

color-coded plot shows the fraction of duplicate pairs with connectivities (k,k') that have gained a mutual 
interaction (cross-interaction) since duplication, as a function of k and k' . Points where all duplicate pairs 
have cross-interactions are shown in white, points where none carry a cross-interactions are shown black. 
Points (particularly at high connectivities) where no data is available are also shown in black. The number 
of duplicate pairs with given connectivities ranges from 2 to 39. Points in the k, fc'-plane where only a single 
pair of duplicates exists arc excluded, (b) For this histogram the data from a) are binned for low, medium, 
and high k + k' and the average for each bin is shown against k + k'. The number of k, k 1 values contributing 
to each bin are 10, 14, and 11, from left to right. Error bars give the standard error, (c) Assuming the 
functional form fk + fk' f° r the probability of a cross-interaction between nodes with connectivities k and 
k' (asymmetric attachment), the most likely values of fk may be deduced from the data (see text). The 
maximum-likelihood result shows an approximately linear increase of fk with k. The alternative scenario, 
symmetric attachment, yields a smaller maximum likelihood. Only duplicate pairs with K a < 0.4 were used 
in this analysis in order to avoid overcounting of cross-interactions of duplicates of even older duplicates. 

work links that connect vertices of connectivities k and k', 

Qk,k' = 1^ y^^fc,fc« c »j^fc',fcj ' (1) 
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Figure 4(a): The asymmetric link dynamics produces a broad connectivity distribution. The 

model prediction of the connectivity distribution of nodes with non-zero connectivity agrees well with yeast 
protein interaction data (filled diamonds). The solution of the rate equation (0} is shown as a solid line, 
the result of a computer simulation emulating the link dynamics encapsulated in |0J for a network of finite 
size is shown as circles (o). Nodes with the highest k (lower right) occur only once in the network, (b) 
High-connectivity vertices are preferentially connected to low-connectivity vertices, as also 
observed empirically. The figure shows the relative likelihood of the link distribution q~k,k' and the 'null 
distribution' <fl k , of an uncorrelated random network, see text. 

where = 1 if node i is linked to j and oth- 
erwise. For convenience, a factor n has been in- 
cluded in the normalization, i.e., J2kk'1k,k' = k. 
The link connectivity distribution Qk,k' captures cor- 
relations between the connectivities of neighboring 
vertices [23,34-36]. It is related to the single- vertex 



connectivity distribution by 

Pk = ^2qk,k>/ k ( 2 ) 

fe' 

for k > and po = 1 — J2k>o Pk- The Ta ^ es a k,k' and 
dk,k' are related to the total rates a and d of link 



detachment per unit time by the normalization 



/ 0>k,k'PkPk' 
k,k' 



a 



(3) 



E 

k,k' 



dk,k'Qk,k' 



For a network of infinite size, link and growth dy- 
namics result in a deterministic differential equation 
for the evolution of the link connectivity distribution 

Qk,k'i 

dqk,k'/dt = a k -i,k'-lPk-lPk'-l - (d k ,k' + g)lk,k' 

— (Jk,k> — Jk-l,k') — (Jk',k — Jk'-1,0) 

The terms Jk,k> arise from links that are not added 
or removed but that change their values (fc, k'), 



JkM' = ^ a-k,k"qk,k'Pk" 
k" 



lk+l,k" + l- 



qk+l,k'Qk+l,k" + l 



Pk+i k 
(5) 

These are the links joining a mutated protein or its 
binding partner with third vertices, shown as open 
circles in fig. l(a,b). The parameter g accounts for a 
uniform increase of the number of nodes caused by 
gene duplications. 

In writing eq. Q), we have assumed that next- 
nearest neighbor connectivity correlations vanish. 
This assumption is self-consistent since the station- 
ary solution has indeed only nearest-neighbor cor- 
relations. Truncating all correlations and writing 
down an evolution equation for the connectivity dis- 
tribution pk turns out to be inconsistent since asym- 
metric link dynamics generates non-trivial connec- 
tivity correlations. This distinguishes the present 
model from simpler models of network growth, which 
can be self-consistently formulated at the level of the 
distribution pk- 

We solved equation eq. (@}, which describes the 
evolution of the connectivity correlations numeri- 
cally for its steady state. For initial conditions 
we use a Poissonian connectivity distribution where 
the average connectivity of connected nodes is 2.5, 
and connectivity correlations which factorize qk,k' ~ 
kk'pkP k > ■ We followed the time evolution of q^k 1 
defined by eq. I0J until a steady state was reached 
using the parameters a and g given above and choos- 
ing d such that the average connectivity of connected 
nodes remains at a constant K = 2.5. This procedure 
leads to a stationary link connectivity distribution 
q~k,k' and a resulting connectivity distribution pk in- 



dependent of initial conditions. Because the evolu- 
tion equation is a rate-equation that applies to a net- 
work of infinite size, the parameters determining the 
stationary state are the ratio between growth and at- 
tachment rate, the functional form of the attachment 
and detachment rates, and the average connectivity. 
The stationary state turns out to be asymptotically 
independent of the duplication rate for small dupli- 
cation rates. In fact, if we solve eq. @ numerically 
for any ratio g/a < 10 , the results are statisti- 
cally indistinguishable from that for g = 0, implying 
great robustness against errors in the rate estimates 
discussed above. 

The statistical properties of our model in its sta- 
tionary state may now be compared with the cor- 
responding quantities in the protein-interaction net- 
work. The connectivity distribution pk agrees well 
with the empirical data as shown in fig. 4(a) along 
^wi th the results of numerical simulations. The dis- 
+tlibution is broad but not scale free. ( From the 
empirical data with connectivities distributed over 
little more than a single decade the scale-free prop- 
erty of protein networks - meaning that connectivi- 
ties are distributed according to a power law - can 
not be confidently ascertained. Furthermore the em- 
pirical data shown in fig. 4 distinctly deviates from a 
power-law.) This also holds for uniform detachment, 
where dkk' = constant, and it is a crucial difference 
to models with symmetric attachment, where prefer- 
ential attachment leads to scale-free networks, both 
at constant network size [32], and in growing net- 
works [3,37]. 

For the connectivity correlations, we find that 
vertices of high k are more frequently linked to ver- 
tices of low k! than in an uncorrelated random net- 
work with the same connectivity distribution pk- 
Fig. 4(b) shows the relative likelihood qk,k'/q^k'' 
where qj? k , — kk'pkpk' /k is the link connectivity dis- 
tribution of the network with no connectivity cor- 
relations. Correlations with this property have re- 
cently been reported for the protein interaction net- 
work in yeast [23], but a quantitative comparison 
with the prediction of our model will have to await 
a greater amount of reliable protein interaction data. 
We note that connectivity correlations are a specific 
property of networks shaped by asymmetric dynam- 
ics, and are absent in the case of symmetric dynam- 
ics, as discussed in the appendix. In other words, the 
empirically observed non-trivial connectivity corre- 
lations require asymmetric link dynamics. This is an 
a posteriori reason for considering asymmetric link 
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dynamics. 

A further consequence of asymmetric attachment 
is that our model does not obey detailed balance 
(as is the case of symmetric link dynamics, where 
attachment and detachment rates do factorize, see 
[32]). Asymmetric attachment or detachment rules 
violate the condition, necessary for detailed balance, 
that the product of transition probabilities along a 
circular trajectory in the space of networks is inde- 
pendent of the direction of this tour. This may be 
demonstrated easily by considering, e.g. four nodes 
labeled 1 — 4 to be connected linearly and discon- 
nected again. Starting and ending with a single link 
between nodes (1,2), say, the product of the rates 
of adding a link between (2,3), then (3,4) before 
removing the links between (2,3) and then (3,4) 
is aoi^22<iii, that for the same tour in reverse is 
aoo«nrfi2) which are generally equal only if the rates 
facorize in their arguments. 



Conclusions 

We have presented a stochastic evolution model for 
protein networks, which is based on fast link dynam- 
ics due to mutations of the coding sequence of ex- 
isting proteins and a slow growth dynamics through 
gene duplications. The crucial ingredient of the link 
dynamics is an asymmetric preferential attachment 
rule, which is supported by empirical data. The 
asymmetry has a simple biological interpretation, 
namely that mutations in one gene may lead to a 
new interaction of its product with that of another, 
unchanged, gene. Such a mechanism, where the two 
nodes involved in the generation of a new link play 
different roles, is probably the norm, rather than the 
exception, in biological networks. This holds partic- 
ularly for regulatory networks, where a new interac- 
tion between two genes is formed by changes in the 
regulatory region of only one of them. 

Asymmetric link dynamics leads to a network 
model, where the aggregate variables necessary to 
describe network structure are the connectivity cor- 
relations qk,k' , which give the fraction of links with 
connectivities k and k! . In our case, the model 
successfully reproduces the connectivity distribution 
found in empirically available protein interaction 
data. The asymmetry of the link dynamics also leads 
to connectivity correlations between interacting pro- 
teins, which have been observed empirically [23]. A 
model with symmetric link dynamics, on the other 



hand, produces no such correlations. Higher order 
correlations of this kind [35] are of particular interest 
for future work as they may be a quantitative signa- 
ture of natural selection on the level of the network 
as a whole. 



Methods 
Data processing 

The protein interaction data in this paper was 
pooled from three sources. The first of these 
sources is a large-scale high-throughput experiment 
using the yeast two-hybrid assay [13] (data avail- 
able from [38]). It comprises 899 pairwise inter- 
actions among 985 proteins. The second source is 
also a high-throughput two-hybrid experiment, from 
which we used a "core" set of 747 interactions be- 
tween 780 proteins, interactions that had been con- 
firmed through replicated experiments [9,39]. The 
third source is the public MIPS database [40, 41] of 
May 2001. From this database, we included only 
pairwise interactions that were not produced by the 
two-hybrid assay, but instead by other techniques 
such as cross-linking or co-purification of two pro- 
teins. This resulted in 899 interactions between 680 
proteins After pooling the three data-sets and elim- 
inating redundant interactions, we were left with a 
network of 2463 interactions and 1893 proteins. 

While enormously valuable in their own right, 
analyses of protein complexes do not identify pair- 
wise protein interactions, and were thus not suit- 
able for our analysis [7,8]. We also excluded in- 
teraction data derived from experiments identifying 
domain-specific rather than whole-protein interac- 
tions [10-12]. For all three data sets taken sepa- 
rately, the connectivity distributions are statistically 
indistinguishable [22]. Moreover, the observations 
on link addition we use here [22] , as well as the pat- 
terns in Fig. 2 hold qualitatively for each data set 
individually. 

Data on yeast gene duplicates, generated as de- 
scribed in [27], was kindly provided by John Conery 
(University of Oregon, Department of Computer Sci- 
ence) . Briefly, gapped BLAST [42] was used for pair- 
wise amino acid sequence comparisons of all yeast 
open reading frames as obtained from GenBank. All 
protein pairs with a BLAST alignment score greater 
than 10 -2 were retained for further analysis. Then, 
the following conservative approach was taken to 
retain only unambiguously aligned sequences: Us- 
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ing the protein alignment generated by BLAST as 
a guide, a sequence pair was scanned to the right 
of each alignment gap. The part of the sequence 
from the end of the gap to the first " anchor" pair of 
matched amino acids was discarded. The remaining 
sequence (apart from the anchor pair of amino acids) 
was retained if a second pair of matching amino acids 
was found within less than six amino acids from the 
first. This procedure was then repeated to the left 
of each alignment gap (see [27] for more detailed de- 
scription and justification). The retained portion of 
each amino acid sequence alignment was then used 
jointly with DNA sequence information to generate 
nucleotide sequence alignments of genes. For each 
gene pair in this data set, the fraction K s of syn- 
onymous (silent) substitutions per silent site, as well 
as the fraction K a of replacement substitutions per 
replacement site were estimated using the method of 
Li [28]. 



Asymmetric link dynamics and connectivity corre- 
lations 

The existence of non-trivial correlations may be at- 
tributed directly to the asymmetry of the link dy- 
namics. Symmetric link dynamics, which is a stan- 
dard mechanism in models of networks at constant 
size [32], leads to networks with uncorrelated con- 
nectivities: Generalizing the approach of [32] to in- 
clude connectivity-dependent detachment, one ob- 
tains for symmetric link dynamics with rates ak and 
dk an equilibrium distribution giving the probabil- 
ity of finding the network in the state given by adja- 
cency matrix dj of P{{c i3 ■}) ~ Y^JIq a k /d k+1 . 
This results in a connectivity distribution pk = 
1/ (^0 Iifc'=o afc ' /<^fc'+i and trivial connectivity cor- 
relations qk,k> ~ kk'pkpk' , which factorize in the con- 
nectivities. This results in a constant <7fc,fc' /<Zfc fc' > 
where q® k , — kk'pkpk' / k. A model with sym- 
metric link dynamics can thus produce any empiri- 
cally observed connectivity distribution, but no net- 
works with statistically significant connectivity cor- 
relations. 
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